ggml: Implement yield barrier using futex for improved thread scheduling efficiency #13079

SongXiaoXi · 2025-04-23T15:53:57Z

Description:
This PR replaces the original spin-based barrier in GGML with a futex-based yield barrier to improve thread scheduling efficiency and overall system performance.

Currently, the feature can be controlled using the CMake parameter GGML_YIELD_BARRIER, allowing users to enable or disable the yield barrier as needed.

Key Benefits:

Improved Scalability
The futex-based barrier allows threads to yield instead of busy-waiting. This reduces CPU waste and improves scalability when the number of threads exceeds the number of physical cores, or when other workloads are competing for CPU time.
Better Performance on Hybrid Architectures
On systems with heterogeneous cores (e.g., big.LITTLE or Intel Hybrid Architecture), yielding helps critical threads get scheduled on performance cores, potentially improving throughput (e.g., PP performance in multi-threaded inference).
Power Efficiency and Thermal Stability
By avoiding unnecessary spinning, this change can reduce power consumption and help maintain higher sustained performance, especially on thermally constrained devices. It may also mitigate CPU throttling under load.

Benchmark:

based on build: 42eb248 (5025)

Apple M1 (4P+4E) (disable Accelerate framework and Metal)

before

model	size	params	backend	threads	test	t/s
qwen2 1B Q4_0	403.20 MiB	630.17 M	CPU	8	pp512	488.30 ± 28.06
qwen2 1B Q4_0	403.20 MiB	630.17 M	CPU	8	tg128	108.54 ± 19.58

after:

model	size	params	backend	threads	test	t/s
qwen2 1B Q4_0	403.20 MiB	630.17 M	CPU	8	pp512	824.37 ± 7.58
qwen2 1B Q4_0	403.20 MiB	630.17 M	CPU	8	tg128	62.45 ± 0.14

Apple M3 Pro (5P + 6E) (disable Accelerate framework and Metal)

before:

model	size	params	backend	threads	test	t/s
llama 8B Q4_0	4.33 GiB	8.03 B	CPU	10	pp512	72.28 ± 0.39
llama 8B Q4_0	4.33 GiB	8.03 B	CPU	10	tg128	11.89 ± 0.42

after:

model	size	params	backend	threads	test	t/s
llama 8B Q4_0	4.33 GiB	8.03 B	CPU	10	pp512	91.85 ± 1.59
llama 8B Q4_0	4.33 GiB	8.03 B	CPU	10	tg128	13.84 ± 0.20

Apple M4 (compile on M1 native)

before

model	size	params	backend	threads	test	t/s
llama 8B F16	14.96 GiB	8.03 B	CPU	4	pp512	15.33 ± 0.01
llama 8B F16	14.96 GiB	8.03 B	CPU	4	tg128	4.85 ± 0.00

after:

model	size	params	backend	threads	test	t/s
llama 8B F16	14.96 GiB	8.03 B	CPU	4	pp512	15.32 ± 0.01
llama 8B F16	14.96 GiB	8.03 B	CPU	4	tg128	4.73 ± 0.00

before:

model	size	params	backend	threads	test	t/s
llama 8B F16	14.96 GiB	8.03 B	CPU	10	pp512	27.93 ± 0.07
llama 8B F16	14.96 GiB	8.03 B	CPU	10	tg128	5.98 ± 0.08

after:

model	size	params	backend	threads	test	t/s
llama 8B F16	14.96 GiB	8.03 B	CPU	10	pp512	28.38 ± 0.07
llama 8B F16	14.96 GiB	8.03 B	CPU	10	tg128	6.10 ± 0.00

Snapdragon 888 (X1 + A78x3 + A55x4)

before:

model	size	params	backend	threads	test	t/s
qwen2 1B Q4_0	403.20 MiB	630.17 M	CPU	8	pp512	210.31 ± 3.34
qwen2 1B Q4_0	403.20 MiB	630.17 M	CPU	8	tg128	39.36 ± 0.35

after:

model	size	params	backend	threads	test	t/s
qwen2 1B Q4_0	403.20 MiB	630.17 M	CPU	8	pp512	300.16 ± 5.45
qwen2 1B Q4_0	403.20 MiB	630.17 M	CPU	8	tg128	14.33 ± 0.08

before:

model	size	params	backend	threads	test	t/s
llama 1B F16	2.79 GiB	1.50 B	CPU	8	pp512	80.65 ± 8.72
llama 1B F16	2.79 GiB	1.50 B	CPU	8	tg128	8.05 ± 0.05

after:

model	size	params	backend	threads	test	t/s
llama 1B F16	2.79 GiB	1.50 B	CPU	8	pp512	95.31 ± 1.67
llama 1B F16	2.79 GiB	1.50 B	CPU	8	tg128	6.45 ± 0.05

Snapdragon 6Gen1 (A78x4 + A55x4)

before：

model	size	params	backend	threads	test	t/s
qwen2 1B Q4_0	403.20 MiB	630.17 M	CPU	8	pp512	196.30 ± 0.58
qwen2 1B Q4_0	403.20 MiB	630.17 M	CPU	8	tg128	30.97 ± 0.17

after:

model	size	params	backend	threads	test	t/s
qwen2 1B Q4_0	403.20 MiB	630.17 M	CPU	8	pp512	261.19 ± 2.26
qwen2 1B Q4_0	403.20 MiB	630.17 M	CPU	8	tg128	11.07 ± 0.11

before:

model	size	params	backend	threads	test	t/s
llama 1B F16	2.79 GiB	1.50 B	CPU	8	pp512	79.43 ± 0.40
llama 1B F16	2.79 GiB	1.50 B	CPU	8	tg128	5.78 ± 0.04

after:

model	size	params	backend	threads	test	t/s
llama 1B F16	2.79 GiB	1.50 B	CPU	8	pp512	79.56 ± 0.34
llama 1B F16	2.79 GiB	1.50 B	CPU	8	tg128	4.45 ± 0.01

Ryzen 9950X (light thermal throttling observed)

before:

model	size	params	backend	threads	test	t/s
llama 8B F16	14.96 GiB	8.03 B	CPU	16	pp512	216.12 ± 0.17
llama 8B F16	14.96 GiB	8.03 B	CPU	16	tg128	4.15 ± 0.00

after:

model	size	params	backend	threads	test	t/s
llama 8B F16	14.96 GiB	8.03 B	CPU	16	pp512	222.44 ± 2.12
llama 8B F16	14.96 GiB	8.03 B	CPU	16	tg128	4.15 ± 0.00

before:

model	size	params	backend	threads	test	t/s
llama 8B F16	14.96 GiB	8.03 B	CPU	32	pp512	221.41 ± 2.07
llama 8B F16	14.96 GiB	8.03 B	CPU	32	tg128	3.94 ± 0.00

after:

model	size	params	backend	threads	test	t/s
llama 8B F16	14.96 GiB	8.03 B	CPU	32	pp512	222.19 ± 4.64
llama 8B F16	14.96 GiB	8.03 B	CPU	32	tg128	3.76 ± 0.04

Ryzen 9950X (spin-based bottleneck: threads > cores)

before:

model	size	params	backend	threads	test	t/s
qwen2 1B Q4_0	403.20 MiB	630.17 M	CPU	33	pp512	59.36 ± 0.43
qwen2 1B Q4_0	403.20 MiB	630.17 M	CPU	33	tg128	0.26 ± 0.00

after:

model	size	params	backend	threads	test	t/s
qwen2 1B Q4_0	403.20 MiB	630.17 M	CPU	33	pp512	2052.45 ± 4.99
qwen2 1B Q4_0	403.20 MiB	630.17 M	CPU	33	tg128	47.28 ± 1.20

Conclusion:

Across most tested devices, the pp512 workload consistently benefits from the futex-based yield barrier, showing noticeable throughput improvements. This is especially evident on high-core-count or hybrid-core systems, where reduced spinning improves scheduling fairness and efficiency.

However, for tg128 — which is typically less compute-intensive and more sensitive to load imbalance — performance may degrade slightly in some cases. This is likely due to the lower thread saturation and increased context switching overhead introduced by yielding, which affects lighter workloads more noticeably.

…ing efficiency

SongXiaoXi · 2025-04-26T13:45:48Z

Hi, I would like to ask for your opinion regarding the use of futex-based yield barriers versus traditional spin barriers.
While yielding improves scalability and efficiency on overloaded systems or hybrid architectures, it may introduce additional context-switching overhead for lighter workloads.

Would appreciate your thoughts on whether a yield-based approach is a good fit for GGML’s threading model on mobile devices or servers under heavy load.

Thank you for your consideration!

slaren · 2025-04-29T16:57:18Z

Hi, sorry for taking so long to respond to this. I think this is very interesting and definitely something that we should working towards, but as it is, the performance hit during generation is in my opinion too high for this to be useful in this state. Ideally, this should be something that is always enabled, and is triggered automatically after spinning for a while. I understand that the code already does this, so I wonder if it is a matter of tuning. If I am not mistaken, the gcc openmp implementation does something like this as well. Hiding this behind a compile flag that is disabled by default, is likely to result in this being dead code that very few people are going to use.

On a less important note, I was not able to replicate the results on an M3 Max. In my tests, this was always slower.

cmake_opts="-DGGML_METAL=OFF -DGGML_BLAS=OFF -DGGML_OPENMP=OFF -DGGML_YIELD_BARRIER=ON" scripts/compare-commits.sh master yield_barrier -m models/qwen2.5-coder-0.5b-instruct-q4_0.gguf -p 256 -n 64 -r 2 -t 4,8,10,12,13,14,15,16 --delay 15

Model	Threads	Test	t/s master	t/s yield_barrier	Speedup
qwen2 1B Q4_0	4	pp256	504.27	494.29	0.98
qwen2 1B Q4_0	4	tg64	189.37	95.92	0.51
qwen2 1B Q4_0	8	pp256	958.59	902.25	0.94
qwen2 1B Q4_0	8	tg64	244.28	64.59	0.26
qwen2 1B Q4_0	10	pp256	1160.34	1073.05	0.92
qwen2 1B Q4_0	10	tg64	241.42	50.81	0.21
qwen2 1B Q4_0	12	pp256	1371.02	1229.41	0.90
qwen2 1B Q4_0	12	tg64	236.30	42.52	0.18
qwen2 1B Q4_0	13	pp256	1306.81	1192.42	0.91
qwen2 1B Q4_0	13	tg64	48.94	34.78	0.71
qwen2 1B Q4_0	14	pp256	1370.33	1190.25	0.87
qwen2 1B Q4_0	14	tg64	42.34	29.45	0.70
qwen2 1B Q4_0	15	pp256	1409.30	1177.26	0.84
qwen2 1B Q4_0	15	tg64	147.98	26.58	0.18
qwen2 1B Q4_0	16	pp256	1388.65	1183.83	0.85
qwen2 1B Q4_0	16	tg64	113.04	24.09	0.21

SongXiaoXi · 2025-04-30T07:30:29Z

"If I am not mistaken, the gcc openmp implementation does something like this as well."

You're right, GCC’s OpenMP implementation does something similar. For reference, here are a few relevant links:

I've also implemented a check for the number of affinity cores to avoid unnecessary spinning — particularly helpful on processes limited by cpuset.

"I was not able to replicate the results on an M3 Max. In my tests, this was always slower."

Regarding your M3 Max(12P+4E I guess) results:

A long-tail effect: once the 12 performance cores complete their tasks, only 4 of them are reassigned to handle the remaining workload from the slower efficiency cores, leaving the other 8 P-cores idle.
The tg phase in qwen2 1B Q4_0 is not compute-intensive enough, making thread scheduling overhead more noticeable.

"Hiding this behind a compile flag ... likely to result in dead code"

Completely agree. The goal is absolutely to make yield_barrier the default in the future. The flag is just a temporary measure while we sort out tuning for cases where generation throughput suffers significantly.

Below is a set of benchmark results from my M3 Pro(5P+6E). It shows that pp512 and pp256 consistently benefit from yield_barrier, while tg128 and tg64 performance drops, especially with higher thread counts — supporting the idea that automatic tuning (e.g., adjusting threads per phase) might be a better long-term solution. According to your results, even with the spin policy, the best performance is achieved with 8 threads, not more.
To fully support hybrid-core CPUs more efficiently, it might be worth considering a work-stealing task queue — but that's still a long way to go.

Model	Threads	Test	t/s master	t/s yield_barrier	Speedup
qwen2 1B Q4_0	5	pp512	1233.17	1207.73	0.98
qwen2 1B Q4_0	5	tg128	221.44	96.06	0.43
qwen2 1B Q4_0	6	pp512	700.10	941.10	1.34
qwen2 1B Q4_0	6	tg128	202.61	64.65	0.32
qwen2 1B Q4_0	7	pp512	766.82	1076.56	1.40
qwen2 1B Q4_0	7	tg128	210.89	57.33	0.27
qwen2 1B Q4_0	8	pp512	865.30	1185.22	1.37
qwen2 1B Q4_0	8	tg128	210.73	55.28	0.26
qwen2 1B Q4_0	9	pp512	931.69	1262.22	1.35
qwen2 1B Q4_0	9	tg128	206.39	45.75	0.22
qwen2 1B Q4_0	10	pp512	973.63	1308.40	1.34
qwen2 1B Q4_0	10	tg128	192.60	45.90	0.24
qwen2 1B Q4_0	11	pp512	888.40	1244.00	1.40
qwen2 1B Q4_0	11	tg128	150.34	41.88	0.28

Model	Threads	Test	t/s master	t/s yield_barrier	Speedup
qwen2 1B Q4_0	5	pp256	1442.20	1390.71	0.96
qwen2 1B Q4_0	5	tg64	208.26	99.09	0.48
qwen2 1B Q4_0	6	pp256	724.18	1072.99	1.48
qwen2 1B Q4_0	6	tg64	202.30	65.29	0.32
qwen2 1B Q4_0	7	pp256	805.83	1197.01	1.49
qwen2 1B Q4_0	7	tg64	209.09	59.03	0.28
qwen2 1B Q4_0	8	pp256	937.62	1306.49	1.39
qwen2 1B Q4_0	8	tg64	211.49	55.08	0.26
qwen2 1B Q4_0	9	pp256	1026.91	1388.99	1.35
qwen2 1B Q4_0	9	tg64	204.45	49.90	0.24
qwen2 1B Q4_0	10	pp256	1044.71	1457.37	1.39
qwen2 1B Q4_0	10	tg64	180.48	45.55	0.25
qwen2 1B Q4_0	11	pp256	942.47	1322.39	1.40
qwen2 1B Q4_0	11	tg64	143.43	42.41	0.30

slaren · 2025-04-30T13:16:41Z

To fully support hybrid-core CPUs more efficiently, it might be worth considering a work-stealing task queue

Wouldn't this be the same that is already implemented for mul_mat and mul_mat_id?

llama.cpp/ggml/src/ggml-cpu/ggml-cpu.c

Line 1456 in 3b127c7

    
           current_chunk = atomic_fetch_add_explicit(&params->threadpool->current_chunk, 1, memory_order_relaxed);

However, this is not supported when repacking Q4_0, since it uses a different implementation of the matrix multiplication functions.

llama.cpp/ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp

Line 6058 in 3b127c7

    
           bool compute_forward(struct ggml_compute_params * params, struct ggml_tensor * op) override {

SongXiaoXi · 2025-04-30T18:09:06Z

Wouldn't this be the same that is already implemented for mul_mat and mul_mat_id?

Ah, I see — I missed this part, you're right. That does implement a similar chunk-level scheduling mechanism.

So the performance regressions I’m seeing are probably due to the spin count not being tuned well — likely due to the need for a smarter, adaptive spin-wait threshold to reduce the cost of falling back to futex() syscalls. I'll do more testing.

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Apr 23, 2025

ggml: Implement yield barrier using futex for improved thread schedul…

73d33b0

…ing efficiency

SongXiaoXi force-pushed the yield_barrier branch from 2fbabd7 to 73d33b0 Compare April 24, 2025 06:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml: Implement yield barrier using futex for improved thread scheduling efficiency #13079

ggml: Implement yield barrier using futex for improved thread scheduling efficiency #13079

SongXiaoXi commented Apr 23, 2025

SongXiaoXi commented Apr 26, 2025

slaren commented Apr 29, 2025 •

edited

Loading

SongXiaoXi commented Apr 30, 2025

slaren commented Apr 30, 2025 •

edited

Loading

SongXiaoXi commented Apr 30, 2025 •

edited

Loading

ggml: Implement yield barrier using futex for improved thread scheduling efficiency #13079

Are you sure you want to change the base?

ggml: Implement yield barrier using futex for improved thread scheduling efficiency #13079

Conversation

SongXiaoXi commented Apr 23, 2025

Key Benefits:

Benchmark:

Apple M1 (4P+4E) (disable Accelerate framework and Metal)

Apple M3 Pro (5P + 6E) (disable Accelerate framework and Metal)

Apple M4 (compile on M1 native)

Snapdragon 888 (X1 + A78x3 + A55x4)

Snapdragon 6Gen1 (A78x4 + A55x4)

Ryzen 9950X (light thermal throttling observed)

Ryzen 9950X (spin-based bottleneck: threads > cores)

Conclusion:

SongXiaoXi commented Apr 26, 2025

slaren commented Apr 29, 2025 • edited Loading

SongXiaoXi commented Apr 30, 2025

slaren commented Apr 30, 2025 • edited Loading

SongXiaoXi commented Apr 30, 2025 • edited Loading

slaren commented Apr 29, 2025 •

edited

Loading

slaren commented Apr 30, 2025 •

edited

Loading

SongXiaoXi commented Apr 30, 2025 •

edited

Loading